Ames <- read.table('https://raw.githubusercontent.com/ajkirkpatrick/FS20/postS21_rev/classdata/ames.csv',
header = TRUE,
sep = ',') %>%
dplyr::select(-Id)12: LASSO
This assignment is due on Monday, April 13th
All assignments are due on D2L by 11:59pm on the due date. Late work is not accepted. You do not need to submit your .rmd file - just the properly-knitted PDF. All assignments must be properly rendered to PDF using Latex. Make sure you start your assignment sufficiently early such that you have time to address rendering issues. Come to office hours or use the course Slack if you have issues. Using an Rstudio instance on posit.cloud is always a feasible alternative. Remember, if you use any AI for coding, you must comment each line with your own interpretation of what that line of code does.
Oh no. Really? Ames again?
Yes, Ames again. Let’s predict some SalePrices!
Data cleaning
Repeat the data cleaning exercise from last week’s lab. The point is to make sure that every observation is non-NA and all predictor variables have more than one value. Use skimr::skim on Ames to find predictors with only one value or are missing many values. Take them out, and use na.omit to ensure there are no NA values left. Check to make sure you still have at least 800 or so observations!
Predicive model
For the assignment below, we’ll use glmnet::cv.glmnet to estimate a LASSO model. Note that you’re asked to state 16 predictors and 5 interactions. You can go beyond this. Unlike our linear model building, complexity in LASSO is not controlled by writing out a bunch of formulas with more terms. It’s in the lambda parameter. So we write one formula and let lambda vary.
Clean your data as described above.
Choose up to 16 predictor variables and clean your data so that no
NAvalues are leftChoose at least 5 interactions between your predictor variables and print out the formula you’ll use to predict
SalePrice.In your code, use
set.seed(24224)so that your results will always be the same. Why do we need to set seed? When we (well,glmnet::cvglmnet) makes the Train and Test sample(s), it’ll select them randomly. If you don’t set seed, every time you run it, you’ll get slightly different answers!Using
glmnet::cv.glmnetto estimate a LASSO model (see lecture notes this week) that predictsSalePricegiven the observed data and using your formula. Slide 33 shows cross-validation using bothalphaandlambda– a LASSO model holdsalphafixed atalpha = 1. We’ll search usinglambdaas our tuning parameter. Call the resulting objectnet_cv.
- To do this, you’ll have to make a matrix to give to
cv.glmnetin thexargument because it doesn’t take a formula. You can usemodel.matrix()to create the matrix. Use that matrix as yourx. It will not add theSalePricevariable to thexmatrix – you just have to give itSalePriceas theyvariable.
The resulting object will be a
glmnetobject. You can see the optimal lambda just by printing the objectprint(net_cv)and looking at the min value for Lambda. Make sure the optimal (RMSE-minimizing) value of lambda is not the largest or smallest value of lambda you gave it. If it is, then extend the range of lambdas until you get an interior solution. Following the instructions from our lecture note’s TRY IT, extract the lambdas and their respective RMSE values into a data.frame and make a plot similar to the RMSE plot from lecture.Answer the following question: What is the optimal
lambdabased on the plot/data? Do you see a minimum point in the plot?Extracting the non-zero coefficients is a little tricky, but let’s do it. We’ll use the
coeffunction to extract the coefficients. Thecoeffunction, when used on aglmnetobject, takes the argumentswhich is thelambdavalue for which you’d like to extract coefficients. Oursvalue should be the best value of lambda, which we can extract fromnet_cv$lambda.min. Put those together:coef(net_cv, s = net_cv$lambda.min). This may be kinda long, that’s OK.